Open-Air Baseball Parks

The Impacts of Weather Conditions

Author
Affiliation

Jason M. Graham

University of Scranton

Code
# load weather and game data for Boston 
boston_game_weather <- read_csv(bos_game_weather_html)
# create separate columns with day, month, and year corresponding
# to game dates; and create a binary win (1) or not (0) column
boston_game_weather <- boston_game_weather %>%
  mutate(year=year(date),
         month=month(date),
         day=day(date),
         wl_binary=ifelse(wl=="W",1,0))
# load sports venue data
pro_sports_venues <- read_csv(venues_html)
# extract baseball parks
baseball_parks <- pro_sports_venues %>%
  filter(Sport == "MLB")

# add info about open roof
baseball_parks$Open_Roof <- c("No",rep("Yes",9),
                              "No","Yes","Yes","Yes",
                              "No","No",
                              rep("Yes",7),"No","Yes",
                              "Yes","No","No","No","Yes")

# load map data for US states minus AK and HI
states_df <- map_data("state") %>%
  filter(region != "alaska" | region != "hawaii")

Executive Summary

Background

Major League Baseball (MLB) is a professional athletics organization made up 30 baseball teams, 29 in the United States and one in Canada, view MLB homepage. Each of the thirty Major League Baseball teams plays in one of thirty stadiums or baseball parks (List of Current Major League Baseball Stadiums - Wikipedia — En.wikipedia.org”). A list of the thirty MLB stadiums can be found on Wikipedia, go to Wikipedia article on MLB stadiums. Of the thirty MLB baseball stadiums, 22 of them are open-air. 1 Figure 1 shows a map of the MLB stadiums and uses color to designate the open-air status of each of the stadiums.

Code
ggplot() +
  geom_polygon(data=states_df,
               aes(long,lat,group=group),
               fill="white",color="darkgrey") + 
  geom_point(data=baseball_parks,aes(x=Longitude,y=Latitude,color=Open_Roof),
             size=2.5,alpha=0.6) +
  scale_color_manual(values=c("orange","#512d6d")) + 
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        plot.title = element_text(hjust = 0.5)) + 
  xlab("") + ylab("") + ggtitle("Major League Baseball Parks")

A plot showing the location of each of the 30 MLB stadiums by its latitude and longitude overlayed on a map of the 48 continental United States. Colors are used to distinguish the open-air status of each of the stadiums, 22 of the 30 stadiums are designated as open-air.

Figure 1: The 30 MLB park locations and their open-air status. Darker shading indicates overlap of nearby parks.

Initial Research Question

We would like to know about the impact that climate or weather may have on MLB teams that play in open-air stadiums. Given that climate and weather patterns are changing and impacting many aspects of our society and lives, and given that many MLB parks are exposed to the elements for many months of the year, it could be that changing climate and weather conditions may impact MLB. In particular, we would like to know if changing climate and weather patterns might have an effect on either team performance or game attendance.

Approach to Research

In order to address our initial research question, we will take as a case study the open-air stadium Fenway Park, home of the Boston Red Sox, view Red Sox homepage. Fenway park in located in Boston, Massachusetts (42°20′46.5″N 71°5′51.9″W) and is the oldest MLB park, open since 1912 (Fenway Park - Wikipedia — En.wikipedia.org”), see the Fenway homepage. Figure 2 shows a view of Fenway Park, image from (Step Inside: Fenway Park - Home of the Red Sox - Ticketmaster Blog — Blog.ticketmaster.com”). Figure 3 shows the same map of MLB stadiums as Figure 1 but with Fenway Park highlighted.

Code
ggplot() +
  geom_polygon(data=states_df,
               aes(long,lat,group=group),
               fill="white",color="darkgrey") + 
  geom_point(data=baseball_parks,aes(x=Longitude,y=Latitude,color=Open_Roof),
             size=2.5,alpha=0.6) +
  geom_point(data=baseball_parks %>% filter(Abbreviation == "BOS"),
                                            aes(x=Longitude,y=Latitude),size=5,color="#512d6d") + 
  geom_label( 
    data=baseball_parks %>% filter(Abbreviation == "BOS"), # Filter data first
    aes(label="Fenway",x=-71,y=43.6),color = "black",
    fill="red") + 
  scale_color_manual(values=c("orange","#512d6d")) + 
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        plot.title = element_text(hjust = 0.5)) + 
  xlab("") + ylab("") + ggtitle("Major League Baseball Parks (Fenway Highlighted)")

Figure 3: MLB park locations and open-air status with Fenway Park in Boston, MA highlighted.

The main part of this project combines data about home games played at Fenway Park with historical Boston area weather data in an attempt to assess if changing climate and weather conditions have impacted either team performance or game attendance at Fenway, with Fenway serving as an example of an open-air baseball stadium.

Data Collection and Documentation

To address our primary question, we collected and combined two sets of data. The first data set collected contains information about each home game played by the Boston Red Sox at Fenway Park. The second data set collected contains weather data such as in Boston over the available time frame of games at Fenway.

Team Data

To obtain the data on home games played by the Boston Red Sox at Fenway Park, we used the function get_retrosheet from the R package retrosheet to download game data for a specific year and team (Douglas and Scriven 2023). Then, we used a function team_games to iterate over all the years that the Red Sox played at Fenway Park, that is, 1912 - 2022 (skipping 2020 due to the COVID-19 pandemic). The code to accomplish this is in the team_home.R script available at our baseball_weather GitHub repository, view the GitHub repo.

The retrosheet package contains a collection of tools to import and structure the single-season event, game-log, roster, and schedule data available from the Retrosheet website which maintains play-by-play accounts of as many major league games as possible. The Retrosheet data we downloaded for the Boston Red Sox has a structure indicated by Table 1.

Code
bos_home_games_url <- "https://raw.githubusercontent.com/jmgraham30/baseball_weather/main/data/boston_home.csv"

bos_home_games <- read_csv(bos_home_games_url)

bos_home_games %>%
  head(5) %>%
  kable()
Table 1: A few rows of the Retrosheet data downloaded for the Boston Red Sox home games at Fenway Park.
date dbl_hdr hm_tm vis_runs hm_runs attendance duration score_diff wl
1912-04-20 0 BOS 6 7 24000 190 1 W
1912-04-23 0 BOS 6 2 2500 145 -4 NW
1912-04-24 0 BOS 5 2 2500 135 -3 NW
1912-04-25 0 BOS 1 4 3000 118 3 W
1912-04-26 0 BOS 6 7 10000 125 1 W

Each row corresponds to a single home game at Fenway. The columns of the data set specify for each game the date, whether the game is a double-header, the number of runs scored by the home team (Boston), the number of runs scored by the visiting team, the game attendance, the duration of the game in minutes, the difference between the home team score and the visiting team score, and whether Boston won (W) or did not win (NW).

The data set contains 8,634 rows or observations.

Weather Data

We used R package RNCEP to download climate and weather data for the Boston area (Kanamitsu 2002). This package simplifies the access, organization, and visualization of weather data from the NCEP/NCAR reanalysis and NCEP/DOE reanalysis II data sets. Specifically, we used the function NCEP.gather.surface to download data for the surface temperature (air.sig995), precipitation (rhum.sig995), and humidity (pr_wtr.eatm). See the documentation for RNCEP for more details and specifications of units used. The climate and weather data we downloaded for Boston has a structure indicated by Table 2.

Code
bos_weather_url <- "https://raw.githubusercontent.com/jmgraham30/baseball_weather/main/data/boston_weather_df.csv"

bos_weather <- read_csv(bos_weather_url)

bos_weather %>%
  head(5) %>%
  kable()
Table 2: A few rows of the climate and weather data downloaded for Boston.
tempK humid precip date
276.3125 63.3750 5.668751 1950-04-01
280.3000 81.0625 12.656252 1950-04-02
281.2875 79.4375 15.000000 1950-04-03
285.2188 92.1250 24.950001 1950-04-04
285.9813 93.6875 27.612501 1950-04-05

The data set contains 15,622 rows or observations. Note that the temperature, humidity, and precipitation values recorded are averaged daily and over a spatial region (between 41 and 42 degrees latitude and 289 and 290 degrees longitude). The code used to download the data is contained in the boston_weather_get.R script available our baseball_weather GitHub repository, view the GitHub repo.

Combined Weather and Game Data

In order to address our initial research question, we combined the Boston weather and home game data. It is important to emphasize the following:

  • Climate and weather data is only available starting in the 1950’s while Boston Red Sox home grame data goes back to 1912. Thus, we must restrict the home game data to those observations starting in the 1950’s.

  • The downloaded climate and weather data contains observations for many more days than just the days on which there were Red Sox home games. Thus, we must extract those climate and weather observatios corresponding to dates of games at Fenway.

All the necessary data manipulations were accomplished through functions from the dplyr package (Wickham et al. 2023). See the boston_data_wrangling.R script available our baseball_weather GitHub repository, view the GitHub repo.

The combined weather game data for the Boston Red Sox home games at Fenway Park has a structure indicated by Table 3.

Code
boston_game_weather %>%
  head(5) %>%
  kable()
Table 3: A few rows of the weather and game data for the Boston Red Sox home games at Fenway Park.
date dbl_hdr hm_tm vis_runs hm_runs attendance duration score_diff wl tempK humid precip tempF year month day wl_binary
1950-04-18 0 BOS 15 10 31822 208 -5 NW 284.8500 80.2500 18.29375 53.06000 1950 4 18 0
1950-04-19 1 BOS 3 6 25425 150 3 W 284.8813 85.9375 17.17500 53.11625 1950 4 19 1
1950-04-19 2 BOS 16 7 32860 190 -9 NW 284.8813 85.9375 17.17500 53.11625 1950 4 19 0
1950-04-28 0 BOS 1 4 5333 116 3 W 282.8500 90.5625 19.40625 49.46001 1950 4 28 1
1950-04-30 1 BOS 0 19 0 131 19 W 280.4438 85.3125 21.54375 45.12876 1950 4 30 1

This is our principal data set and it contains 5,727 rows.

Exploratory Analysis and Results

The script boston_home_weather_eda.R contains the full exploratory analysis and all corresponding code.

Potential Response Variables

Figure 4 shows the distributions of runs scored by the Red Sox and the visiting team, as well as the number of games and the difference in scores between the Red Sox and visiting teams. The median number of runs for Boston is 5 while the median number of runs for visitors is 4.

Code
ra_1 <- boston_game_weather %>%
  ggplot(aes(x=hm_runs)) + 
  geom_histogram(color="#512d6d",fill="lightblue") + 
  xlab("Number of Runs by Boston") + 
  ylab("Count")
# Number of visitor runs
ra_2 <- boston_game_weather %>%
  ggplot(aes(x=vis_runs)) + 
  geom_histogram(color="#512d6d",fill="lightblue") + 
  xlab("Number of Runs by Visitors") + 
  ylab("Count")
# Boston runs - Visitor runs
ra_3 <- boston_game_weather %>%
  ggplot(aes(x=score_diff,fill=wl)) + 
  geom_histogram(color="#512d6d") + 
  scale_fill_manual(values = c("#E69F00","#009E73")) + 
  labs(x = "Boston Runs - Visitor Runs",
       y = "Count",
       fill="Win/Not")
# Win or Not Win
ra_4 <- boston_game_weather %>%
  ggplot(aes(x=wl,fill=wl)) + 
  geom_bar(color="#512d6d") + 
  scale_fill_manual(values = c("#E69F00","#009E73")) + 
  theme(legend.position = "none") +
  labs(x = "No Win or Win",
       y = "Count")

(ra_1 + ra_2) / (ra_4 + ra_3)

Figure 4: Exploratory plots for the distributions of Red Sox runs, visitor runs, score difference, and wins for Red Sox home games played at Fenway Park from 1950 to 2022 (excluding 2020).

We observe from Figure 4 that the Red Sox have won more games than they have lost and tend to win by a median of one run.

Figure 5 displays the distributions for the duration and attendance of Red Sox home games played at Fenway Park from 1950 to 2022 (excluding 2020). The median duration for a game is 170 minutes while the median attendance is around thirty thousand. Note that Fenway Park currently seats over 37,000 (Fenway Park - Wikipedia — En.wikipedia.org”).

Code
rb_1 <- boston_game_weather %>%
  ggplot(aes(x=duration)) + 
  geom_histogram(color="#512d6d",fill="lightblue") + 
  labs(x = "Game Duration",
       y = "Count")
# Game attendances
rb_2 <- boston_game_weather %>%
  ggplot(aes(x=attendance)) + 
  geom_histogram(color="#512d6d",fill="lightblue") + 
  labs(x = "Game Attendance",
       y = "Count")

(rb_1 + rb_2)

Figure 5: Exploratory plots for the distributions of game duration and attendance and Red Sox home games played at Fenway Park from 1950 to 2022 (excluding 2020).

One potentially important observation based on Figure 5 is that there are some Red Sox home games for which the recorded attendance is 0. Specifically, there are 243 such games. This value is determined using the following code:

Code
boston_game_weather %>%
  filter(attendance == 0) %>%
  nrow()
[1] 243
Code
boston_game_weather %>%
  filter(attendance == 0) %>%
  ggplot(aes(x=year)) + 
  geom_bar(color="#512d6d",fill="lightblue") + 
  labs(x="Year",y="Count",title="Observations with Zero Recorded Attendance")

Figure 6: The number of Red Sox home games by year with a recorded attendance of zero.

It’s not immediately clear why there may be a zero recorded for attendance at some games. This could be an error but further investigation is required to determine for certain. It might be necessary to remove those observations with zero attendance depending on what specific questions we decide to address with models.

Potential Predictor Variables

Our main interest in this project is to examine how climate and weather factors impact aspects of baseball games such as the score, outcome, duration, and attendance games. As previously noted, we have data about humidity, precipitable water2, and temperature for the Red Sox home games.

Climate and Weather Variables

Figure 7 show the distributions for humidity (as a percent), temperature (in degrees Fahrenheit), and precipitable water (in units of \(\frac{\text{kg}}{\text{m}^2}\)).

Code
w_1 <- boston_game_weather %>%
  ggplot(aes(x=humid)) + 
  geom_histogram(color="#512d6d",fill="lightblue") + 
  labs(x = "Humidity",
       y = "Count")

w_2 <- boston_game_weather %>%
  ggplot(aes(x=tempF)) + 
  geom_histogram(color="#512d6d",fill="lightblue") + 
  labs(x = "Temperature",
       y = "Count")

w_3 <- boston_game_weather %>%
  ggplot(aes(x=precip)) + 
  geom_histogram(color="#512d6d",fill="lightblue") + 
  labs(x = "Precipitation",
       y = "Count")

(w_1 + w_2 + w_3)

Figure 7: Exploratory plots for the climate and weather variables on dates of Red Sox home games played at Fenway Park from 1950 to 2022 (excluding 2020).

We see that the median temperature is about 65 degrees Fahrenheit, the median humidity is about 83 percent, and the median precipitable water is about 25 \(\frac{\text{kg}}{\text{m}^2}\).

Variable Relationships

We would like to determine what if any impact there has been on the score, outcome, duration, or attendance of Red Sox games over time by temperature, humidity, or precipitable water.

Figure 8 shows the number of runs scored by the Boston Red Sox at each home game from 1950 to 2022 (2020 excluded) plotted against date, temperature, humidity, and precipitable water, respectively.

Code
hs_date <- boston_game_weather %>%
  ggplot(aes(x=date,y=hm_runs)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Date",y="Red Sox Runs")
hs_temp <- boston_game_weather %>%
  ggplot(aes(x=tempF,y=hm_runs)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Temperature (F)",y="Red Sox Runs")
hs_humid <- boston_game_weather %>%
  ggplot(aes(x=humid,y=hm_runs)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Humidity",y="Red Sox Runs")
hs_precip <- boston_game_weather %>%
  ggplot(aes(x=precip,y=hm_runs)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Precipitable Water",y="Red Sox Runs")

(hs_date + hs_temp) / (hs_humid + hs_precip)

Figure 8: The number of runs scored by Red Sox at home games at Fenway Park plotted against the game date, and climate and weather variables.

Here it is difficult to see any clear pattern of influence that date, temperature, humidity, or precipitable water have on the number of runs the Red Sox score at home games over the specified period of time. We observe that there are potential outliers in each of the plots in Figure 8.

Figure 9 shows shows the wins for the Boston Red Sox at each home game from 1950 to 2022 (2020 excluded) plotted against date, temperature, humidity, and precipitable water, respectively.

Code
ws_date <- boston_game_weather %>%
  ggplot(aes(x=date,y=wl,color=wl)) + 
  geom_point(size=2,alpha=0.2) + 
  scale_color_manual(values = c("#E69F00","#009E73")) + 
  theme(legend.position = "none") + 
  labs(x="Date",y="Win/Not")
ws_temp <- boston_game_weather %>%
  ggplot(aes(x=tempF,y=wl,color=wl)) + 
  geom_point(size=2,alpha=0.2) + 
  scale_color_manual(values = c("#E69F00","#009E73")) +
  labs(x="Temperature (F)",y="Win/Not", color="Win/Not")
ws_humid <- boston_game_weather %>%
  ggplot(aes(x=humid,y=wl,color=wl)) + 
  geom_point(size=2,alpha=0.2) + 
  scale_color_manual(values = c("#E69F00","#009E73")) +
  theme(legend.position = "none") +
  labs(x="Humidity",y="Win/Not")
ws_precip <- boston_game_weather %>%
  ggplot(aes(x=precip,y=wl,color=wl)) + 
  geom_point(size=2,alpha=0.2) + 
  scale_color_manual(values = c("#E69F00","#009E73")) +
  theme(legend.position = "none") +
  labs(x="Precipitable Water",y="Win/Not")

(ws_date + ws_temp) / (ws_humid + ws_precip)

Figure 9: The wins by Red Sox at home games at Fenway Park plotted against the game date, and climate and weather variables.

Again, it is difficult to see any clear pattern of influence that date, temperature, humidity, or precipitable water have on whether the Red Sox win at home games over the specified period of time.

Figure 10 shows the duration of games for the Boston Red Sox at each home game from 1950 to 2022 (2020 excluded) plotted against date, temperature, humidity, and precipitable water, respectively.

Code
ds_date <- boston_game_weather %>%
  ggplot(aes(x=date,y=duration)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Date",y="Game Duration")
ds_temp <- boston_game_weather %>%
  ggplot(aes(x=tempF,y=duration)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Temperature (F)",y="Game Duration")
ds_humid <- boston_game_weather %>%
  ggplot(aes(x=humid,y=duration)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Humidity",y="Game Duration")
ds_precip <- boston_game_weather %>%
  ggplot(aes(x=precip,y=duration)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Precipitable Water",y="Game Duration")

(ds_date + ds_temp) / (ds_humid + ds_precip)

Figure 10: The game duration for Red Sox home games at Fenway Park plotted against the game date, and climate and weather variables.

We clearly see that the duration of games has generally increased from 1950 to 2022. However, it is difficult to see any clear pattern of influence that temperature, humidity, or precipitable water have on whether the duration of Red Sox home games.

Figure 11 shows the attendance of games for the Boston Red Sox at each home game from 1950 to 2022 (2020 excluded) plotted against date, temperature, humidity, and precipitable water, respectively.

Code
as_date <- boston_game_weather %>%
  filter(attendance > 0) %>%
  ggplot(aes(x=date,y=attendance)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Date",y="Game Attendance")
as_temp <- boston_game_weather %>%
  filter(attendance > 0) %>%
  ggplot(aes(x=tempF,y=attendance)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Temperature (F)",y="Game Attendance")
as_humid <- boston_game_weather %>%
  filter(attendance > 0) %>%
  ggplot(aes(x=humid,y=attendance)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Humidity",y="Game Attendance")
as_precip <- boston_game_weather %>%
  filter(attendance > 0) %>%
  ggplot(aes(x=precip,y=attendance)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Precipitable Water",y="Game Attendance")

(as_date + as_temp) / (as_humid + as_precip)

Figure 11: The game attendance for Red Sox home games at Fenway Park plotted against the game date, and climate and weather variables.

We clearly see that the attendance of games has generally increased from 1950 to 2022. Further, it also appears that attendance increases with is that of temperature, humidity, and precipitable water. Some additional points include:

  1. It appears that over time, the variance of game attendance has decreased, and there is a period of time (roughly 2005-2010) that the variance of game attendance is very low relative to other years. It is potentially interesting to determine a potential cause for this since it suggests that the Red Sox games where consistently well-attended during that time.

  2. While attendance appears to increase with increasing temperature, humidity, and precipitable water it is unlikely that an increase in temperature, humidity, or precipitable water causes more people to attend Red Sox home games. For example, attendance might be increasing over time for any number of reasons and perhaps temperature, humidity, and precipitable water are increasing with time. Further exploratory analysis might provide insight on this.

Further Exploratory Analysis

Figure 12 shows the average number of home game wins per year for the Red Sox.

Code
boston_game_weather %>%
  group_by(year) %>%
  summarise(wins_by_year=sum(wl_binary)) %>%
  ggplot(aes(x=year,y=wins_by_year)) + 
  geom_point(size=2,color="darkgreen") + 
  geom_smooth(color="#512d6d",fill="lightblue") + 
  labs(x="Year",y="Number of Wins")

Figure 12: The number of home games at Fenway Park which the Red Sox won plotted against year.

We observe that from around 2003 to 2009, the Red Sox where consistently winning a large number of games. This could relate to our earlier observation regarding the time period of consistently high game attendance observed in Figure 11.

Figure 13 shows the change in the both the daily and yearly average climate and weather variables over time.

Code
wa_temp <- boston_game_weather %>%
  ggplot(aes(x=date,y=tempF)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Date",y="Temperature")
wb_temp <- boston_game_weather %>%
  group_by(year) %>%
  summarise(mean_temp=mean(tempF)) %>%
  ggplot(aes(x=year,y=mean_temp)) + 
  geom_point(size=2,color="darkgreen",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="lightblue") + 
  labs(x="Year",y="Mean Temperature")

wa_humid <- boston_game_weather %>%
  ggplot(aes(x=date,y=humid)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Date",y="Humidity")
wb_humid <- boston_game_weather %>%
  group_by(year) %>%
  summarise(mean_humid=mean(humid)) %>%
  ggplot(aes(x=year,y=mean_humid)) + 
  geom_point(size=2,color="darkgreen",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="lightblue") + 
  labs(x="Year",y="Mean Humidity")

wa_precip <- boston_game_weather %>%
  ggplot(aes(x=date,y=precip)) + 
  geom_point(size=2,color="lightblue",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="darkgreen") + 
  labs(x="Date",y="Percipitable Water")
wb_precip <- boston_game_weather %>%
  group_by(year) %>%
  summarise(mean_precip=mean(precip)) %>%
  ggplot(aes(x=year,y=mean_precip)) + 
  geom_point(size=2,color="darkgreen",alpha=0.5) + 
  geom_smooth(color="#512d6d",fill="lightblue") + 
  labs(x="Year",y="Mean Percipitable Water")

(wa_temp + wb_temp) / (wa_humid + wb_humid) / (wa_precip + wb_precip)

Figure 13: Plots showing the change in the climate and weather variables over time.

It appears that there may be some temporal trends in the climate and weather variables over time. In order to explore more about how the climate and weather over time may have an impact on the score, outcome, duration, or attendance of Red Sox home games, we will need to develop an appropriate model and assess it’s predictive accuracy.

Models

We will use different models to address several questions. First, we will use logistic regression to assess whether there is a relationship between the climate and weather variables, time, and visiting team performance and the outcome of the game (win or loss). Second, we will use linear regression to assess whether there is a relationship between the climate and weather variables, time, and visiting team performance and the number of runs scored by Boston. Finally, we will use decision trees to model the duration and attendance of games.

For the first two models, we use the bootstrap in order to estimate the sampling distributions and confidence intervals for the model coefficients. For the decision tree models, we use cross-validation to tune hyperparameters and estimate the predictive accuracy of the models.

Model Results

Predictors of Wins

Figure 14 shows the bootstrap estimates for the logistic regression coefficients for the predictors of win/loss along with estimates for the 95% confidence intervals. Since the confidence intervals for date, humididty, precipitation, and temperature all include 0, we cannot conclude that there is a relationship between these variables and the outcome of the game. However, we do observe that there is evidence that a visiting team’s performance is negatively associated with the probability of a win for the Red Sox. While this is not a surprising result, it provides us with a quantitative estimate of the impact of the visiting team’s performance on the outcome of the game.

Figure 14: Bootstrap estimates for logistic regression coefficients for win/loss.

Figure 15 shows a series of bootstrap predictions for the probability of a win for the Red Sox as a function of the visiting team’s performance. We observe that as the visiting team’s performance increases, the probability of a win for the Red Sox decreases. The observed data is plotted along with the model predictions.

Figure 15: Bootstrap predictions for logistic regression model for win/loss.

Predictors of Runs Scored

Figure 16 shows the bootstrap estimates for the linear regression coefficients for the predictors of number of runs scored by teh Red Sox along with estimates for the 95% confidence intervals. Since the confidence intervals for humididty, precipitation, and temperature all include 0, we cannot conclude that there is a relationship between these variables and the number of runs. The model does provide evidence that the number of runs scored by the Red Sox has increased over time, and that the visiting team’s performance is positively associated with the number of runs scored by the Red Sox. However, in both cases the effect size is quite small.

Figure 16: Bootstrap estimates for linear regression coefficients for number of runs scored by Red Sox.

Game Attendance Model Results

Figure 17: Bootstrap estimates for linear regression coefficients for game attendance.

Figure 18: Variable importance measures for predictors in tuned decision tree model for the game attendance.

Figure 19: Predicted values versus observed values for tuned decision tree model for game attendance.
Table 4: Metrics to assess model perforamnce for tuned decision tree to predict game attendance.
Metric Value
rmse 0.1709739
rsq 0.5375165
mae 0.0987486

Game Duration Model Results

Figure 20: Bootstrap estimates for linear regression coefficients for game duration.

Figure 21: Variable importance measures for predictors in tuned decision tree model for the game duration.

Figure 22: Predicted values versus observed values for tuned decision tree model for game duration.
Table 5: Metrics to assess model perforamnce for tuned decision tree to predict game duration.
Metric Value
rmse 22.6739381
rsq 0.4363977
mae 16.6034855

Conclusions

References

Douglas, Colin, and Richard Scriven. 2023. Retrosheet: Import Professional Baseball Data from ’Retrosheet’. https://CRAN.R-project.org/package=retrosheet.
Fenway Park - Wikipedia — En.wikipedia.org.” https://en.wikipedia.org/wiki/Fenway_Park.
Kanamitsu, et al. 2002. “NCEP/DOE AMIP-II Reanalysis (r-2).” Bull. Amer. Meteor. Soc. 83: –12.
List of Current Major League Baseball Stadiums - Wikipedia — En.wikipedia.org.” https://en.wikipedia.org/wiki/List_of_current_Major_League_Baseball_stadiums.
Step Inside: Fenway Park - Home of the Red Sox - Ticketmaster Blog — Blog.ticketmaster.com.” https://blog.ticketmaster.com/step-inside-fenway-park-boston-ma/.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Sonoma 14.0
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2023-10-15
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
 quarto   1.3.450 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package      * version date (UTC) lib source
 broom        * 1.0.5   2023-06-09 [1] CRAN (R 4.3.0)
 dials        * 1.2.0   2023-04-03 [1] CRAN (R 4.3.0)
 dplyr        * 1.1.3   2023-09-03 [1] CRAN (R 4.3.0)
 forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
 GGally       * 2.1.2   2021-06-21 [1] CRAN (R 4.3.0)
 ggplot2      * 3.4.4   2023-10-12 [1] CRAN (R 4.3.1)
 infer        * 1.0.5   2023-09-06 [1] CRAN (R 4.3.0)
 kableExtra   * 1.3.4   2021-02-20 [1] CRAN (R 4.3.0)
 knitr        * 1.44    2023-09-11 [1] CRAN (R 4.3.0)
 lubridate    * 1.9.3   2023-09-27 [1] CRAN (R 4.3.1)
 modeldata    * 1.2.0   2023-08-09 [1] CRAN (R 4.3.0)
 parsnip      * 1.1.1   2023-08-17 [1] CRAN (R 4.3.0)
 patchwork    * 1.1.3   2023-08-14 [1] CRAN (R 4.3.0)
 purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 RColorBrewer * 1.1-3   2022-04-03 [1] CRAN (R 4.3.0)
 readr        * 2.1.4   2023-02-10 [1] CRAN (R 4.3.0)
 recipes      * 1.0.8   2023-08-25 [1] CRAN (R 4.3.0)
 rsample      * 1.2.0   2023-08-23 [1] CRAN (R 4.3.0)
 scales       * 1.2.1   2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo  * 1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 skimr        * 2.1.5   2022-12-23 [1] CRAN (R 4.3.0)
 stringr      * 1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 tidymodels   * 1.1.1   2023-08-24 [1] CRAN (R 4.3.0)
 tidyr        * 1.3.0   2023-01-24 [1] CRAN (R 4.3.0)
 tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
 tune         * 1.1.2   2023-08-23 [1] CRAN (R 4.3.0)
 workflows    * 1.1.3   2023-02-22 [1] CRAN (R 4.3.0)
 workflowsets * 1.0.1   2023-04-06 [1] CRAN (R 4.3.0)
 yardstick    * 1.2.0   2023-04-21 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

  1. By “open-air” we mean permanently open with no option for roof cover. Stadiums with either a fixed or retractable roof are not considered to be open-air for our purposes.↩︎

  2. Precipitable water is the amount of water potentially available in the atmosphere for precipitation, usually measured in a vertical column that extends from the Earth’s surface to the upper edge of the troposphere.↩︎